Feature engineering is the ‘art’ of formulating useful features from existing data following the target to be learned and the machine learning model used.
FEATURE ENGINEERING PIPELINE
Why ??
DEMONSTRATION
import pandas as pddata={'Candy Variety':['Chocolate Hearts','Sour Jelly','Candy Canes','Sour Jelly','Fruit Drops'], 'Date and Time':['09-02-2020 14:05','24-10-2020 18:00','18-12-2020 20:13','25-10-2020 10:00','18-10-2020 15:46'], 'Day':['Sunday','Saturday','Friday','Sunday','Sunday'], 'Length':[3, 3.5, 3.5, 3.5, 5], 'Breadth':[2,2,2.5,2,3], 'Price':[7.5, 7.6, 8, 7.6, 9]}df = pd.DataFrame(data)df['Date and Time'] = pd.to_datetime(df['Date and Time'], format="%d-%m-%Y %H:%M")df
Candy Variety
Date and Time
Day
Length
Breadth
Price
0
Chocolate Hearts
2020-02-09 14:05:00
Sunday
3.0
2.0
7.5
1
Sour Jelly
2020-10-24 18:00:00
Saturday
3.5
2.0
7.6
2
Candy Canes
2020-12-18 20:13:00
Friday
3.5
2.5
8.0
3
Sour Jelly
2020-10-25 10:00:00
Sunday
3.5
2.0
7.6
4
Fruit Drops
2020-10-18 15:46:00
Sunday
5.0
3.0
9.0
which kind of candy is most likely to sell the most on a particular day?
df['Date']=df['Date and Time'].dt.datedf[['Candy Variety','Date']]
Candy Variety
Date
0
Chocolate Hearts
2020-02-09
1
Sour Jelly
2020-10-24
2
Candy Canes
2020-12-18
3
Sour Jelly
2020-10-25
4
Fruit Drops
2020-10-18
Feature Engineering in action
import numpy as npdf['Weekend'] = np.where(df['Day'].isin(['Saturday', 'Sunday']), 1, 0)df[['Candy Variety','Date','Weekend']]
Candy Variety
Date
Weekend
0
Chocolate Hearts
2020-02-09
1
1
Sour Jelly
2020-10-24
1
2
Candy Canes
2020-12-18
0
3
Sour Jelly
2020-10-25
1
4
Fruit Drops
2020-10-18
1
FEATURE ENGINEERING TECHNIQUES
1) Imputation
Imputation deals with handling missing values in data.
Categorical Imputation: Missing categorical values are generally replaced by the most commonly occurring value in other records
Numerical Imputation: Missing numerical values are generally replaced by the mean of the corresponding value in other records
Add missing values
data={'Candy Variety':['Chocolate Hearts','Sour Jelly','Candy Canes','Sour Jelly','Fruit Drops'], 'Date and Time':['09-02-2020 14:05','24-10-2020 18:00','18-12-2020 20:13','25-10-2020 10:00','18-10-2020 15:46'], 'Day':['Sunday','Saturday','Friday','Sunday','Sunday'], 'Length':[3, 3.5, 3.5, 3.5, 5], 'Breadth':[2,2,2.5,2,3], 'Price':[7.5, 7.6, 8, 7.6, 9]}df = pd.DataFrame(data)df['Date and Time'] = pd.to_datetime(df['Date and Time'], format="%d-%m-%Y %H:%M")#Appending a row with missing valuesdf.loc[len(df.index)] =[np.NaN,'22-10-2020 17:24:00','Thursday', 3.5, 2, np.NaN]df
Discretization involves essentially taking a set of values of data and grouping sets of them together in some logical fashion into bins (or buckets).
METHODS
Grouping of equal intervals
Grouping based on equal frequencies (of observations in the bin)
Grouping based on decision tree sorting (to establish a relationship with target)
Example
df['Type of Day']=np.where(df['Day'].isin(['Saturday', 'Sunday']), 'Weekend', 'Weekday')df[['Candy Variety','Day','Type of Day']]
Candy Variety
Day
Type of Day
0
Chocolate Hearts
Sunday
Weekend
1
Sour Jelly
Saturday
Weekend
2
Candy Canes
Friday
Weekday
3
Sour Jelly
Sunday
Weekend
4
Fruit Drops
Sunday
Weekend
5
Sour Jelly
Thursday
Weekday
3) Categorical Encoding
Categorical encoding is the technique used to encode categorical features into numerical values which are usually simpler for an algorithm to understand.
One hot encoding(OHE)
Simply adds a new 0/1 feature for every category, having 1 (hot) if the sample has that category
Can explode if a feature has lots of values, causing issues with high dimensionality
What if test set contains a new category not seen in training data?
Either ignore it (just use all 0’s in row), or handle manually (e.g. resample)
EXAMPLE
for x in df['Type of Day'].unique(): df[x]=np.where(df['Type of Day']==x,1,0)df[['Candy Variety','Day','Type of Day','Weekend','Weekday']]
Candy Variety
Day
Type of Day
Weekend
Weekday
0
Chocolate Hearts
Sunday
Weekend
1
0
1
Sour Jelly
Saturday
Weekend
1
0
2
Candy Canes
Friday
Weekday
0
1
3
Sour Jelly
Sunday
Weekend
1
0
4
Fruit Drops
Sunday
Weekend
1
0
5
Sour Jelly
Thursday
Weekday
0
1
Drawback
It could result in a dramatic increase in the number of features and result in the creation of highly correlated features.
Other methods
Count and Frequency encoding- captures each label’s representation,
Mean encoding -establishes the relationship with the target
Ordinal encoding- number assigned to each unique label.
4) Feature Splitting
Splitting features into parts can sometimes improve the value of the features toward the target to be learned.
EXAMPLE
df['Date and Time'] = pd.to_datetime(df['Date and Time'])df['Date']=df['Date and Time'].dt.datedf[['Candy Variety','Date']]
Candy Variety
Date
0
Chocolate Hearts
2020-02-09
1
Sour Jelly
2020-10-24
2
Candy Canes
2020-12-18
3
Sour Jelly
2020-10-25
4
Fruit Drops
2020-10-18
5
Sour Jelly
2020-10-22
5) Handling Outliers
Outliers are unusually high or low values in the dataset which are unlikely to occur in normal scenarios.
Since these outliers could adversely affect your prediction they must be handled appropriately. The various methods of handling outliers include:
Removal: The records containing outliers are removed from the distribution. However, the presence of outliers over multiple variables could result in losing out on a large portion of the datasheet with this method.
Replacing values: The outliers could alternatively bed treated as missing values and replaced by using appropriate imputation.
Capping: Capping the maximum and minimum values and replacing them with an arbitrary value or a value from a variable distribution.
Discretization
6) Variable Transformations
Variable transformation techniques could help with normalizing skewed data.
Logarithmic transformations operate to compress the larger numbers and relatively expand the smaller numbers. This in turn results in less skewed values especially in the case of heavy-tailed distributions.
Other variable transformations used include Square root transformation and Box cox transformation which is a generalization of the former two.
7) Scaling
Feature scaling is done owing to the sensitivity of some machine learning algorithms to the scale of the input values. This technique of feature scaling is sometimes referred to as feature normalization.
S
TYPES OF SCALING
The commonly used processes of scaling include:
Min-Max Scaling: This process involves the rescaling of all values in a feature in the range 0 to 1. In other words, the minimum value in the original range will take the value 0, the maximum value will take 1 and the rest of the values in between the two extremes will be appropriately scaled.
Standardization/Variance scaling: All the data points are subtracted by their mean and the result divided by the distribution’s variance to arrive at a distribution with a 0 mean and variance of 1.
EXAMPLE
Min-max scaling
Scales all features between a given \(min\) and \(max\) value (e.g. 0 and 1)
Makes sense if min/max values have meaning in your data
Subtracts the median, scales between quantiles \(q_{25}\) and \(q_{75}\)
New feature has median 0, \(q_{25}=-1\) and \(q_{75}=1\)
Similar to standard scaler, but ignores outliers
plot_scaling(scaler=RobustScaler())
8) Creating Features
Feature creation involves deriving new features from existing ones.
This can be done by simple mathematical operations such as aggregations to obtain the mean, median, mode, sum, or difference and even product of two values.